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Abstract 

Consider the problem of sampling sequentially from a finite number of A ^ 2 populations, specified by 
random variables Xj^, i = 1,... ,N, and k = 1,2,...; where Xj^ denotes the outcome from population i the 
time it is sampled. It is assumed that for each fixed i, is a sequence of i.i.d. normal random variables, 

with unknown mean /i,- and unknown variance C7?. The objective is to have a policy 7t for deciding from which 
of the N populations to sample from at any time r = 1,2,... so as to maximize the expected sum of outcomes of 
n total samples or equivalently to minimize the regret due to lack on information of the parameters p.j and a?. 
In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in 
the sense of Theorem 4 below. This resolves a standing open problem from |Bumetas and Katehakis| ( |1996bl l. 
Additionally, finite horizon regret bounds are giverQ 
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1. Introduction and Summary 

Consider the problem of a controller sampling sequentially from a finite number of N 2 populations or 
‘bandits’, where the measurements from population i are specified by a sequence of i.i.d. random variables 
taken to be normal with finite mean /r,- and finite variance ff?. The means {/i,} and variances {cr?} 
are taken to be unknown to the controller. It is convenient to define the maximum mean, jj* = max,{/r,}, 
and the bandit discrepancies {A,} where A, = /r* — /r, ^ 0. It is additionally convenient to define as the 
minimal variance of any bandit that achieves /r*, that is = min,:^.^^* (7?. 

In this paper, given k samples from population i we will take the estimators; S]{k) = 

{X'f —X'j^^/k for jXi and of respectively. Note that the use of the biased estimator for the variance, with 
the 1 /k factor in place of l/(k — 1), is largely for aesthetic purposes - the results presented here adapt to the 
use of the unbiased estimator as well. 

1. Substantial portion of the results reported here were derived independently by Cowan and Katehakis, and by Honda 
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For any adaptive, non-anticipatory policy n, 7z{t) = i indicates that the controller samples bandit i at time f. 
Define T^{n) = Y!i=\ = *1’ denoting the number of times bandit i has been sampled during the periods 

t = 1,..., n under policy tt; we take, as a convenience, r^(0) = 0 for all i, n. The value of a policy n is the 
expected sum of the first n outcomes under tt, which we define to be the function V,i{n) : 


14 (n) =E 


N 4(") 

ULn 


i=l k=\ 


N 


= Y.fi,E[Tl{n)], 


( 1 ) 


where for simplicity the dependence of 14 (n) on the true, unknown, values of the parameters /r = (/ii,..., 
and a2 = ((Tf,...,(T2), is supressed. The pseudo-regret, or simply regret, of a policy is taken to be the 
expected loss due to ignorance of the parameters p and by the controller. Had the controller complete 
information, she would at every round activate some bandit i* such that = p* = max,{/rj}. For a given 
policy n, we define the expected regret of that policy at time n as 


Rj^{n) =np* -Vn{n) = [Tj^{n 

i=\ 


( 2 ) 


It follows from Eqs. Q and Q that maximization of 14 (n) with respect to n is equivalent to minimization 
of R-itin). This type o f loss due to ig norance of the means (regret) was first introduced in the context of 

(for which 


Robbins 


(1952 1 as the ‘loss per trial’ L^{n)/n — p* — 


an = 2 problem by 

Rnin) = E [L;c(n)]), constructing a modified (along two sparse sequences) ‘play the winner’ policy, Kr, such 
that in) = o{n) (a.s.) and (n) =o{n), using for his derivation only the assumption of the Strong Law of 


Large Numbers. Following Burnetas and Katehakis| ( [l996b I when n —> oo, if tt is such that Rj[{n) — o{n) we 
say policy n is uniformly convergent (UC) (since then lim„_s.ool4(«)/« = P* )■ However, if under a policy 
n, Rjt{n) grew at a slower pace, such as Rjc(n) = o(n^/^), or better Rji{n) = etc., then the controller 

would be assured that n is making a effective trade-off between exploration and exploitation. It turns our that 
it is possible to construct ‘uniformly fast convergent’ (UFC) policies, also known as consistent or strongly 
consistent, defined as the policies tr for which: 

R 7 i(n) = o{n“), for all a > 0 for all (p,^). 


The existence of UFC policies in the case considered here is well established, e.g., Auer et al. ( 2002| l (fig. 4. 
therein) presented the following UFC policy TTacf^ 


Policy ttacf (UCBl-NORMAL). At each n = 1,2,...: 

i) Sample from any bandit i for which 

ii) If («) > [8 Inn], for all i = 1,..., A, sample from bandit np,ce(n + 1) with 

?tACF(n + 1) = argmax; +4- Si{T^(n))^ |. 

(Taking, in this case, Sj{k) as the unbiased estimator.) 


(3) 


Additionally, Auer et al. (20021 (in Theorem 4. therein) gave the following bound: 

^'Iacf(”) ^ MACF(M)^^)lnn + CACF(M)) for all n and all {p,a^), 


(4) 
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with 


=256 Y, 

i=l 

Cacf(m) = (1 + E^'- 

^ i=l 


(5) 

(6) 


Ineq. Q readily implies that ^ Macf(M,^^) ln« + o(ln«). Thus, since Inn = o(n“) for all a > 0 

and /?;j^j,p(n) ^ 0, it follows that TTacf is uniformly fast convergent. 


Given that UFC policies exist, the question immediately follows: just how fast can they be? The primary 
motivation of this paper is the following general result, from Burnetas and Katehakis ( 1996b) l, where they 
showed that for any UFC policy tt, the following holds: 


liminf 

n—>oo 


Inn 


> for all 


(7) 


where the bound itself is determined by the specific distributions of the populations, in this case 





(8) 


For comparison, depending on the specifics of the bandit distributions, there is a considerable distance be¬ 
tween the logarithmic term of the upper bound of Eq. Q and the lower bound implied by Eq. (|^. 


The derivation of Ineq. Q implies that in order to guarantee that a policy is uniformly fast convergent, sub- 
optimal populations have to be sampled at least a logarithmic number of times. The above bound is a special 
case of a more general result derived in Burnetas and Katehakis ( 1996b| l (part 1 of Theorem 1 therein) for 
distributions with multi-parameters being unknown (such as in the current problem of Normal populations 
with both the mean and the variance being unknown): 




yi A/ 


with = inf(^, : )i',- > M*, > 0} = (l/2)ln(l -f ^). 

Previously, |Lai and Robbins | ( |1985[ ) had obtained such lower bounds for distributions with one-parameter 
(such as in the current problem of Normal populations with unknown mean but known variance). Allocation 


policies that achieved the lower bounds were called asymptotically efficient or optimal in Lai and Robbins 
( |7985] |. 


Ineq. Q motivates the dehnition of a uniformly fast convergent policy n as having a uniformly maximal 
convergence rate (UM) or simply being asymptotically optimal, within the class of uniformly fast convergent 
policies, /\nn = since then Vn{n) = njj.* —MBK(M;f 7 ^)liin + o(lnn). 

Burnetas and Katehakis| ( [T996b| l proposed the following index policy ttbk as one that could achieve this lower 
bound: 
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Policy ttbk (UCB-NORMALO) 

i) For n= 1,2,..., 2N sample each bandit twice, and 

ii) for n ^ 2N, sample from bandit 7rBK(« +1) with 

7tBK(n+ 1) = argmax; ^ |' 

[Burnetas and Katehakis] ( |1 9965) 1 were not able to establish the asymptotic optimality of the ;rBK policy be¬ 
cause they were not able to establish a sufficient condition (Condition A3 therein), which we express here as 
the following equivalent conjecture (the referenced open question in the subtitle). 


Conjecture 1 For each i, for every e > 0, and for k ^ o°, the following is true: 

P [x'j + Si{j)\/ k^l i — I < lli — e for some 2^ j =o{l/k). (10) 


We show that the above conjecture is false (cf. Proposition |^in the Appendix). This does not imply that tTbk 
fails to be UM (i.e., to be asymptotically optimal), but this failure means that the techniques established in 
Burnetas and Katehakis) (T996b|l are insufficient to verify its optimality. All is not lost, however. One of the 


central results of this paper is to establish that with a small change, the policy tTbk may be modified to one 
that is provably asymptotically optimal. We introduce in this paper the policy TTchk defined in the following 
way: 


Policy ttchk (UCB-NORMAL^) 

i) For n= 1,2,..., 3N sample each bandit three times, and 

ii) for n ^ 3N, sample from bandit 7rcHK(« + 1) with 


?rcHK(n+l) = argmax,. 



+ S,iTlin)) 



2 

Tiin)-2 



( 11 ) 


Remark 1 


1) Note that policy tTchk is only a slight modification of policy ttbk, the only difference between their indices 
is the —2 in the power on n under the radical, i.e., 2/(7’^(n) — 2) in ?rcHK(n +1) replacing 2/7'^ (n) in 7rBK(« + 
1). This change, while seemingly asymptotically negligible (as in practice T^{n) °o (a.s.) with n), has a 
profound effect on what is provable about TTchk- 


Burnetas and Katehakis 


2) We note that the indices of policy TTchk a significant modification of those of the optimal allocation 
policy n ^2 for the case of normal bandits with known variances, cf. 

)Katehakis and Robbins)([T995|l, which are: 


(1996b I and 


7t^2(n + l)= argmax,- + d,- 


/ 21n« 


the difference being replacing the term O’,-, in 7r„2 by Si(T^{n))\l 2 _ jn However, the 

indices of policy TTacf am a minor modification of the optimal policy ;r<j. the difference being replacing the 


term OiJ in Tta, by SfT^fi)) 


161nH 

Tiin) 


m ttacf- 
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3) The ;rBK and policies can be seen as connected in the following way, however, observing that 2 lnn/7’^(n) 

is a first-order approximation of — 1 = e2inn/r,i(n) _ j 


Following Robbins (19521, and additionally Gittins 

\\919), Lai and Robbins ()1985)l and Weber (19921 there 

is a large literature on versions of this problem, cf. 

Bumetas and Katehakis ( 2003)1, Bumetas and 

Catehakis 

(1997bll and references therein. For recent work in this area we refer tojAudibert et al.|(|20091l. 

Auer and 

Ortner|(2010|», Gittins et al. (2011|l, 

Bubeck and Slivkins 

(2012)1, Cappe et al. (20131, Kaufmann ( 

201511, |Li 

et al. (2014)1,[Cowan and Katehakis[i 

2015b)l, Cowan and K 

atehakis)([2015c[), and references therein. For more 

general dynamic programming extensions we refer to[Burnetas and Katehakis ( 1997a|l, 

Butenko et al. (2003)1, 

Tewari and Bartlett (20081, Audibert et al. (2009[) 

)Littman (20121, Feinberg et al. ( 

2014|l and references 

therein. Other related work in this area includes: [Bumetas and Katehakis ([1993|l, [Bumetas and Katehakis 

( 1996a^, Lagoudakis and Parr (2003)1, Bartlett and Tewari 

P009)l,[Tekin and Liu[p012 

,|Jouini et al. (2009|l, 

Dayanik et al. (2013)1, Filippi et al. ( 

2010)1, Osband and Van Roy)( 2014)1, Denardo et al 

(2W3t 

To our knowledge, outside the work in Lai and Robbins () 1985 1 , Bumetas and Katehakis 

([1996b)land 

Bumetas 


and Katehakis (|1997a i, asymptotically optimal policies have only been developed in in 


kis ( 

(2011 1 , and in |Honda and Takemura|p010| for the problem of finite known support where optimal policies, 
cyclic and randomized, that are simpler to implement than those consider in |Burnetas and Katehakis |(| 1996b[ ) 
were constructed. Recently in Cowan and Katehakis ( 2015a| l, an asymptotically optimal policy for uniform 
bandits of unknown support was constructed. The question of whether asymptotically optimal policies exist 
in the case discussed herein of normal bandits with unknown means and unknown variances was recently 
resolved in the positive by [Honda and Takemura ( 2()13| l who demonstrated that a form of Thompson sampling 
with certain priors on achieves the asymptotic lower bound Mbk(AI)^^)- 

The structure of the rest of the paper is as follows. In section 2, Theorem[^establishes a finite horizon bound 
on the regret of tTchk- From this bound, it follows that tTchk is asymptotically optimal (Theorem]^, and 
we provide a bound on the remainder term (Theorem]^. Additionally, in Section]^ the Thompson sampling 
policy of Honda and Takemura ( 2013| l and TTchk are compared and discussed, as both achieve asymptotic 
optimality. 


Honda and 


akemura 


2. The Optimality Theorem and Finite Time Bounds 


The main results of this paper, that Conjecture 1 is false (cf. Proposition]^ in the Appendix), the asymptotic 
optimality, and the bounds on the behavior of TTchk^ all depend on the following probability bounds; we note 
that tighter bounds seem possible, but these are sufficient for this paper. 


Proposition 2 Let Z, U be independent random variables, Z ~ N{0,1) a standard normal, and U 
chi-squared distribution with d degrees of freedom, where df^2. 

For d > 0, p > 0, the following holds for all k ^ 1.' 


1 

2 


P 


< P (5 + VUs/kVP - 1 < z) 




^-{l+Sf/2p i^(l-d)/p 

25^\/d Ink 


Xd fl 


( 12 ) 


Proof [of Proposition]^ The proof is given in the Appendix. 


Theorem 3 For policy itcHK os defined above, the following bounds hold for all n ^ 3N and all £ G (0,1).‘ 

21nn 




\ln(l + ^^) 


K 8(j,^ , , 8 8(j? ^ ^ 

Inlnn-I- ^ +4 A,-. 


2e 


A}e^ 


(13) 
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Before giving the proof of this bound, we present two results, the hrst demonstrating the asymptotic optimality 
of tTchk. the second giving an e-free version of the above bound, which gives a bound on the sub-logarithmic 
remainder term. It is worth noting the following. The bounds of Theorem|^can actually be improved, through 
the use of a modified version of Proposition]^ to eliminate the In Inn dependence, so the only dependence on 
n is through the initial Inn term. The cost of this, however, is a dependence on a larger power of 1/e. The 
particular form of the bound given in Eq. ([T^ was chosen to simplify the following two results, cf. Remark 
4 in the proof of Propositition]^ 


Theorem 4 For a policy Kchk os defined above, itcHK is asymptotically optimal in the sense that 


lim 




n-foo Inn 

Proof [of Theorem]^ For any e such that 0 < e < 1, we have from Theoremj^that the followings holds; 

2A, 


(14) 


liinsup^^^ii^^< y ,, 


(l+r) 


(15) 


Taking the infimum over all such e. 


lim sup y 


2A; 


',WM*ln| 


(- 1 ) 

and observing the lower bound of Eq. 0 completes the result. 




(16) 


Theorems For a policy Kchk os defined above, Rjo^^^in) < MB/f(^,ff^)lnn + C>((lnn)^/"^lnlnn), andmore 
concretely 


where 


+-^C//A’(//) ^^ ) (1*1 ” ) 




/ 


A? 


lWc//tr(M,^^) = 10 E 


A? 


V(f^" + A?)ln(l + |) 


= E (Ar-t- 


^ chk {^->^) =E A;'- 


A; 


(17) 


(18) 
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While the above bound admittedly has a more complex form than such a bound as in Eq. Q, it demonstrates 
the asymptotic optimality of the dominating term, and bounds the sub-linear remainder term. 

Proof [of Theorem The bound follows directly from Theorem taking e = 5 (Inn) for n ^ 3, and 
observing the following bound, that for e such that 0 < e < 1 / 2 , 


1 


ln(l + 


(l-eE A 

af l+£ J 



This inequality is proven separately as Proposition[7]in the Appendix. 


(19) 


We make no claim that the results of Theorems are the best achievable for this policy tTchk- At several 
points in the proofs, choices of convenience were made in the bounding of terms, and different techniques 
may yield tighter bounds still. But they are sufficient to demonstrate the asymptotic optimality of TTchk, and 
give useful bounds on the growth of 

Proof [of Theorem 1] In this proof, we take tt = tTchk as defined above. For notational convenience, we 
define the index function 

Ui{kJ) = x\ + Si{j)\lkT^-\. (20) 


The structure of this proof will be to bound the expected value of T^{n) for all sub-optimal bandits i, and use 
this to bound the regret Rit{n). The basic techniques follow those in |Katehakis and Robbins (19951 for the 
known variance case, modified accordingly here for the unknown variance case and assisted by the probability 
bound of Proposition]^ For any i such that /i, ^ /i*, we define the following quantities; Fet 1 > £ > 0 and 
define £ = A,£/2. For n ^ 3N, 

n\{n,e) = 1 ) = ^ < Hi + e,Sf{TX)) < X{l+e)} 


t=3N 


AXe) = 52 + 1) = iMtXX)) > ^ + > c^^(l +e)} 


t=3N 




t=3N 


( 21 ) 


n‘4in,e)= 52 l{7t{t + l) = i,Ui{t,TX)) <1^*-£}■ 

t=3N 

Hence, we have the following relationship for n ^ 3N, that 

n 

T^{n + l) = 3+ 52 l{7t(f + l) = i} = 3 +ni(n,e)+n 2 (”:£)+” 3 (”;£)+”U”i£)- (22) 

t=3N 


The proof proceeds by bounding, in expectation, each of the four terms. 
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Observe that, by the structure of the index function m,, 


ri(n,e)< ^ %{t + \) = i, + + -1 ^ii*-e 


t=3N 


= £ l<^7r(f + l) = /,7'i(f)s^ 


21 nf 


t=3N 


Infl I ^ M 

+ (1+r) J 


+ 2 


= £ l<^;r(f + l) = /,ri(f)< 


2 Inf 


t=3N 


V ^ (i+«) J 


+ 2 


< 


£ l<^7r(f + l) = /,7'i(f)< 


21 nn 


t=3N 


+ of (i+£) J 


+ 2 


^£l<^ ;r(f + l) = /,7;(f)< 


2 Inn 


t=i 


In 1 


+ (1+r) J 


< 


21 nn 


In 1 


1 ,^+ af ( 1 +e) J 


+ 2 + 2 . 


(23) 


The last inequality follows, observing that Tf^{t) may be expressed as the sum of n{t) = i indicators, and 
seeing that the additional condition bounds the number of non-zero terms in the above sum. The additional 
+2 simply accounts for the ;r(l) = i term and the n{n + 1) =i term. Note, this bound is sample-path-wise. 

For the second term, 


«2(«>e)< E MT^{t + '^) = hSKTL{t))>cyf{\ + s)} 

t=3N 

= ii l{7tit + 1 ) = i,sjik) > cjf (1 + e), Tl{t) = k} 

t=3Nk=2 

= E 'tl{n{t + \) = iji,{t) = k}l{s}{k)>af{\+£)} (24) 

t=3Nk=2 

^ E HSfik) > CT?(1 +£)} E Wt + 1) = i^it) = k} 

k=2 t=k 

^Y.HsKk)>(yf{^+£)}- 

k=2 

The last inequality follows as, for fixed k, {n{t -f 1) — i, Tj^{t) = k] may be true for at most one value of t. 
Recall that kS^{k)/af has the distribution of a random variable. Letting Uk ~ from the above we 
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have 

E[4(«,e)] P(52(^)>(T2(i + e)) 

k=2 

f;P(t/,_iA>(l + e)) 

k=2 

^Y^V{Uk-i/{k-\)>{\ + e)) 

k=2 


= f;p(t/,>^(l + e)) 

k=l 



(25) 


The penultimate step is a Chernoff bound on the terms, P {Uk > k{\ + e)) ^ (e ^(1 + 

To bound the third term, a similar rearrangement to Eq. ( [24l i (using the sample mean instead of the sample 
variance) yields: 




(26) 


t=3N 


k=2 


Recalling that Xj^ — /r, ~ Zail\/k for Z a standard normal, 

E [n3(n,e)] < 52 IP (2^1 > M''+ s) 12 'P £ 


k=2 


k=\ 


< 




2of 


<c 


(27) 


e2<^r -1 


The penultimate step is a Chernoff bound on the terms, v{z> < e 

To bound the n\ term, observe that in the event n{t + 1) = /, from the structure of the policy it must be true 
that Ui{t, T^(t)) = maXjUj(t, Ti{t)). Thus, if i* is some bandit such that /r,* = /r*, Ui*{t, (t)) ^ Ui{t, T^{t)). 
In particular, we take i* to be a bandit that not only achieves the maximal mean /i*, but also the minimal 
variance among optimal bandits, aft = af. We have the following bound. 


t=3N 

f !{«,■*(?,ri*(f))<M*-e} (28) 

t=3N 

n 

^ 52 — e for some 3 ^ i < f}. 

t=3N 

The last step follows as for t in this range, 3 ^ (t) Hence 


n 

E[n 4 (n,e)]< ^ F{ui*{t,s) < ^*— e for some3 ^ s ^t). (29) 

t=3N 

As an aside, this is essentially the point at which the conjectured Eq. ( [T0| ) would have come into play for 
the proof of the optimality of ttbk, bounding the growth of the corresponding term for that policy. We will 
essentially prove a successful version of that conjecture here. Define the events A*, g = {«,•(?,i) < ji* — e}. 
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Observing the distributions of the sample mean and sample variance, we have (similar to Eq. (1^) for Z a 
standard normal and t/j_i ^ with t/, Z independent. 


«r,r) = P - + V^V f ^ - 1 < Z 


< 


g-(g/c7.)V2(^_2) /f-1 


< 




2(e/(7H,)^iA/e(i- 1 ) 

„-(g/a.)V 2 1 /f-l\ 


Inf 


(30) 


2{£/a^f 'Jes Vlnfy 
i, 2(e/(7H.)^v/e ] vA 


g-(g/cr,) i/2 ^f-1 

Inf 


where the first inequality follows as an application of Proposition]^ and the second since i ^ 3. Applying a 
union bound to Eq. 


E [«'(«,£)] ^ Y. L^{Kr,e 


< 


t=3Ns=3 
n t 

EE 


1 


^-{Ela,fsl2 /f-l\ 


s/; 


f=3Ai=3 V2(e/o’*) \fe j V'S 
1 \ /■“ e-{£/<^*)^V 2 


Inf / 


lieja^fy/e ) /s=o 

1 \ vAtt 

2(£/a,fy/eJ (£/<y*) 


' h:q 


dt 


(31) 


In Inn 


= \ - 5 -lnlnn. 

V 2e 

The bounds follow, removing the dependence of the i-sum on f by extending it to oo, and bounding the sums 
by integrals of the (decreasing) summands by slightly extending the range of each. Erom the above results, 
and observing that T^{n) ^ 7'^(n + 1), it follows from Eq. (|22ll that for any e such that 0 < e < 1, 


E [Tl{n)] ^ 


21nn 


V + of (l+e) ) 


In 


< 


21nn 


A A[(l^\ 

+ (l+e) J 


. 8 2al 

T 4 H—^ H— 


i 8 8 (J? 

+ 4+^ + 


-^ In Inn 

2e £3 


In 


n 8(j3 


£2 A?£3 V 2.6 AA3 


(32) 


In Inn. 


The result then follows from the definition of regret in Eq. 


Remark 2 Numerical Regret Comparison; Eigure 1 shows the results of a small simulation study done on 
a set of six populations with means and variances given in Table 1. It provides plots of the regrets when 
implementing policies tTchK) '^hce, and Ttg a ‘greedy’ policy that always activates the bandit with the current 
highest average. Each policy was implemented over a horizon of 100,000 activations, each replicated 10,000 
times to produce a good estimate of the average regret Rji{n) over the times indicated. The left plot is on the 
time scale of the first 10,000 activations, and the right is on the full time scale of 100,000 activations. 
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Mi 

8 

8 

7.9 

7 

-1 

0 


1 

1.4 

0.5 

3 

1 

4 


Table 1 


R„(n) R,(ii) 




Figure 1: Numerical Regret Comparison of ^CHK^ snd tzq\ Left: [0,10,000] range, Right: [0,100,000] range. 


Remark 3 Bounds and Limits: Figure 2 shows first (left) a comparison of the theoretical bounds on the regret, 
(”) ^?!chk (”) representing the theoretical regret bounds of the RHS of Eq. Q and Eq. ( [T3] l respec¬ 
tively, taking e = (ln«)^/^ in the latter case, for the means and variances indicated in Table 1. Additionally, 
Eigure 2 (right) shows the convergence of /\\\n to the theoretical lower bound 



Eigure 2: Left: Plots of and B^tchk (”)• Right: Convergence of R^^chk(”)/ iri)”) to Mbk(M)^^) 


3. A Comparison of ;rcHKaRd Thompson Sampling 


Honda and Takemura ( 2013[ l proved that for a < 0, the following Thompson sampling algorithm is asymp¬ 
totically optimal, i.e., lim„^ocR;EcHK(”)/lti« = 
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Policy ttts (TS-N0RMAL“) 


i) Initially, sample each bandit n ^ max(2,3 — [2aJ) times. 


ii) For n: For each i generate a random sample t/^ from a posterior distribution for /i;, 

given andapriorfor 

iii) Then, take 

TTts (n + 1) = arg max,-. 

(33) 


Policies TTts and ttchk differ decidedly in structure. One key difference, ttts is an inherently randomized 
policy, while decisions under TTchk are completely determined given the bandit results at a given time. Given 
that both %s and TTchk are asymptotically optimal, it is interesting to compare the performances of these two 
algorithms over finite time horizons, and observe any practical differences between them. To that end, two 
small simulation studies were done for different sets of bandit parameters (/X, ff^). In each case, the uniform 
prior a = —1 was used. The simulations were carried out on a 10,000 round time horizon, and replicated 
sufficiently many times to get good estimates for the expected regret over the times indicated. 
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Figure 3; Numerical Regret Comparison of ;rcHK™d for the parameters, of Table 1, left and Table 2, right. 
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Table 2 


We observe from the above, and from general sampling of bandit parameters, that %s TTchk generally 
produce comparable expected regret. A general exploration of random parameters suggests that, on average, 
%s is slightly superior to ;rcHK in cases where all bandits have roughly equal variances, while ;rcHK has 
an edge when the optimal bandits have large variance relative to the other bandits, and the size of the ban¬ 
dit discrepancies. It is additionally interesting to note that in the cases pictured above, the superior policy 
also demonstrated the smaller variance in sample regret (Figure 4). Additional numerical experiments, not 
pictured here, indicate that the superior policy in each case may exhibit a slightly heavier tail distribution 
towards larger regret. In general, the question of which policy is superior is largely context specific. 
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Variance of Regret 



Variance of Regret 


2000 4000 6000 8000 10000 


Figure 4: Numerical comparison of variance of sample regret for ;icHK™d for different parameters, of Table 1, left 
and Table 2, right. 
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Appendix A. Additional Proofs 

Proof [of Proposition Let P = P ^5 -f '/U\l If^l P — I < z'j . Note immediately, P P (5 + < Z). 


Eurther, 


P ^ P (5 -b s/Uk^lP < Z and Vuk^'P > 5^ 



(34) 


Where fd{u) is taken to be the density of a -random variable. Letting u = k^l^u. 



(35) 
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Observing that ^ 1, 


, 

J5^ Jly/U 


-z^l2 


-rf/2-l 


e '^dzdu 


^ (/t2/p) Js2 JzVa 2dnr{d/2) 

= k-‘^^P¥ ( 2 VU < Z and t/ ^ 5^) 

= ^k-‘^^PF {4U ^ and U ^ 5^) = ^k-‘‘/P¥ Qz^ ^ U 




(36) 


The exchange from integral to probability is simply the interpretation of the integrand as the joint pdf of U 
and Z. 

For the upper bound, we utilize the classic normal tail bound, P(x<Z)<e l{x\/^). 



e-5V2 
< F 

^ - a \/£7-\/'t^/p - 1 - A £/- 1 ) 

(5 + Vt7\//t2/p-l)v^ 




Observing the bound that for positive x, e ^ 1/x, and recalling that d^2. 


(37) 




- 52/2 


5V^ 


E 


8y/U\/kVP-\ 

52/2 


S^^j2%\/k'^/P — 1 
e-5V2 


E 


t/-ie-zf2(/=2/p-i) 


52v^\/A;2/p — 1 \ 


(38) 


Here we utilize the following bounds: g-* — 1 > (e/2)x^, which is easy to prove, and T{d/2—l/2)/T{d/2) ^ 
yj2%ld^ which may be proved on integer ^ 2 by induction. This yields: 


g-(i+a2)/2^^(i-£/)/p 

252 In^ 


(39) 


This completes the proof 

Remark 4 Room for Improvement: The choice of the — 1 ^ (e/2)x2 bound above was in fact arbitrary - 
other bounds, such as involving alternative powers of x, could be used. This would influence how the result¬ 
ing bound on P is utilized, for instance in the proof of Theorem]^ The use of ^ l/.r in Eq. ( |3^ should 
be considered similarly. ■ 


Proposition 6 Conjecture 1 is false and for each i,for e > 0, 

P [x]+Si{i)\/k'^lt - \ < pti - efor some 2 ^ j ^ k^ 

ijk 


> as k ^ < 


(40) 


Proof [of Proposition!^ Define the events Af ^ ^ = {Xj + Si{j)s/k^/j— 1 < /r, — e}. As the samples are taken 
to be normally distributed with mean /r, and variance we have that Aj —/r,- ~ Zotj^j and S\{^j') ~ afC / j. 
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where Z is a standard normal, U ^ j, and Z, U independent. Hence, 

= P ^Z^ + ^U^Vk^j - 1 < -e j = P (^^./j + VuVkyj -1<Z^. (41) 

The last step is simply a re-arrangement, and an observation on the symmetry of the distribution of Z. For 
7 ^ 3, we may apply Proposition Inhere ford — j —1, p = j, to yield 

For a fixed 70 ^ 3, for k ^ 70 we have 

V (a'j 1 ^^ for some 2 ^ j ^ k'j ^ 0{l/k)k'-^-’°. (43) 

The proposition follows immediately. ■ 


Proposition 7 For G > 0, 0 ^ e < 1/2, the following holds: 

1 1 lOG 

ln(l+GM) ^ ^ (1+G)ln(l+G)"^' 


(44) 


Proof For any G > 0, the function 1/ln ^1 + is positive, increasing, and convex on £ € [0,1) 

(Proposition]^. For a given G > 0, noting that the above inequality holds (as equality) at £ = 0, due to the 
convexity it suffices to show that the inequality is satisfied at £ = 1 /2, or 


< 


5G 


ln(l + f) " (l-pG)ln(l+G)^ ln(l+G)' 


(45) 


Equivalently, we consider the inequality 




ln(l+G)2 


(46) 


Define the function F(G) to be the RHS of Ineq. ( |46l l. Note that as G —0, F{G) —> 0, and in simplified form 
we have (for G > 0 and the limit asG—t 0), 


F\G) 


((l+G)ln(l+G )-(6 + G)ln(l-fg))" 
(l+G)2(6 + G)ln(l + f)" 


(47) 


It follows that F (G) > 0, and hence the desired inequality holds at £ = 1 /2. This completes the proof. ■ 


Proposition 8 The function FIq{£) = 1/ln ^1 -f positive, increasing, and convex in e G [0, l),/or 

any constant G > 0. 


17 

















Cowan, Honda and Katehakis 


Proof That Hc{£) is positive and increasing in e, follows immediately from inspection of Hg and H'q, given 
the hypotheses on G, and e. To demonstrate convexity, by inspection of the terms of //p(e), it suffices to 
show that for all relevant G, and e, the following inequality holds. 


2G(l-£)^(3 + £)^+(-8(l+£) + G(l-£)^(l+£(6 + £)))ln^l+G ^^^_^^j ^ ^0. (48) 

Defining C = G(1 — e )^/(1 + £), it is sufficient to show that for all C > 0 and £ € [0,1) (eliminating a factor 
of (1 + £) from the above), 

2C(3 + £)2 + (-8 + C(l+£(6 + £)))ln(l+C) ^0. (49) 

Defining Jc{£) as the LHS of the above, note that Jg{e) = 2C(3 + £)(2 + ln(l +C)) > 0. It suffices then to 
show 7c(0) ^ 0, or 18C+ (C-8)ln(l +C) > 0. Note this holds at C = 0, and c//c/C[7c(0)] = (10+ 19C)/(1 + 
C) +ln(l +C) > 0 for C ^ 0. Hence, Jc{£) ^ 0, and Hq^e) ^0. ■ 
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